Evaluation of a hybrid pipeline for automated segmentation of solid lesions based on mathematical algorithms and deep learning

We evaluate the accuracy of an original hybrid segmentation pipeline, combining variational and deep learning methods, in the segmentation of CT scans of stented aortic aneurysms, abdominal organs and brain lesions. The hybrid pipeline is trained on 50 aortic CT scans and tested on 10. Additionally, we trained and tested the hybrid pipeline on publicly available datasets of CT scans of abdominal organs and MR scans of brain tumours. We tested the accuracy of the hybrid pipeline against a gold standard (manual segmentation) and compared its performance to that of a standard automated segmentation method with commonly used metrics, including the DICE and JACCARD and volumetric similarity (VS) coefficients, and the Hausdorff Distance (HD). Results. The hybrid pipeline produced very accurate segmentations of the aorta, with mean DICE, JACCARD and VS coefficients of: 0.909, 0.837 and 0.972 in thrombus segmentation and 0.937, 0.884 and 0.970 for stent and lumen segmentation. It consistently outperformed the standard automated method. Similar results were observed when the hybrid pipeline was trained and tested on publicly available datasets, with mean DICE scores of: 0.832 on brain tumour segmentation, and 0.894/0.841/0.853/0.847/0.941 on left kidney/right kidney/spleen/aorta/liver organ segmentation.

www.nature.com/scientificreports/ algorithms and for evaluating their performance. To address this problem, hybrid approaches to segmentation have been proposed, whereby variational methods are used to supplement manual ground truth labels [7][8][9] , thus reducing the demand for manual hand drawn segmentation. We have developed our own fully automated hybrid approach (a hybrid segmentation pipeline) for volumetric segmentation, which combines a recently developed variational model 10 with a deep learning algorithm. In this work, we tested the accuracy of this hybrid segmentation pipeline.

Methods
Reference standards. To test our pipeline, we chose contrast-enhanced computed tomography (CT) angiograms of patients being followed up after endovascular repair (EVAR) of abdominal aortic aneurysms (AAAs), as the stented aorta is an example of a solid lesion that requires serial follow-up with cross-sectional imaging. We chose post-EVAR CTs because serial comparison of scans is a common diagnostic problem, which can be time-consuming and labour intensive. The choice was also due, in part, with familiarity (three authors are vascular interventionists), and to the fact that our group had previously used and tested a reproducible manual segmentation technique on these scans 11,12 . CTs were performed with a 64-slice Siemens Somatom scanner (Siemens Healthcare, Frimley, UK). Manual segmentation was performed by one of the authors (MH) on reconstructed 2 mm thick slices with intervals of 2 mm, according to a previously described technique [11][12][13] .
We acquired a total of 70 fully anonymised postoperative CTs, of which 50 were manually segmented to provide a "ground truth" for training the deep learning algorithm, and to provide useful evaluation metrics. The manual segmentation ("ground truth labelling") was conducted between the lowermost renal artery and the aortic bifurcation, by hand, using an open source application called ITK-SNAP 14 . This segmentation was considered the reference standard.
We used a typical 60:20:20 ratio for training, validation, and testing the images: 30 sets to train the deep learning part, 10 sets to validate and prevent overfitting, and reserved the final 10 sets for evaluation purposes. Our pipeline also used 20 unlabelled datasets during the training phase, providing us with 50 volumes in total during training.
Additionally, we evaluated the pipeline on two publicly available datasets: The Brain Tumour Segmentation challenge (BraTS) [15][16][17] , and the Abdomen data from the Multi-Atlas Labelling Beyond the Cranial Vault challenge 18 . The BraTS dataset contains a range of MR modalities, but for our purposes we considered only the fluid attenuated inversion recovery (FLAIR) volumes. All sets have been labelled by one to four raters following the same protocol, and their annotations approved by experienced neuro-radiologists. We segmented only the tumour region in each volume. We used a total of 200 volumes for the BraTS data: 120 during training, 40 for validation and 40 for evaluation purposes. In addition to the BraTS, we used the Abdomen dataset, a collection of CT scans of the abdomen in which 13 organs have been segmented by two experienced undergraduate students and the segmentation verified by a radiologist. We evaluated our pipeline on five of the 13 organs: the spleen, the right kidney, the left kidney, the liver and the aorta. The Abdomen dataset contains 30 scans; of these, we used 15 for training, 5 for validation and 10 for evaluation.

Pipeline tests.
After windowing and selecting the uppermost and lowermost slice for segmentation on either a post-EVAR aneurysm or a selected organ from the BraTS or Abdomen datasets, we ran our variational model 10 . This original model uses an enhanced method of edge detection, which allows for images containing low contrast to be segmented effectively. Following edge detection, the region of interest is segmented based on image intensity and pixel location in the image 10 . The variational method provided us with a good but not perfect initial segmentation, as some regions may contain no contrast at the boundary. Furthermore, artifacts may result in poor definition in certain areas.
Although it is possible to obtain accurate segmentation by using the variational method only, this is a timeconsuming process, taking up to 20 min for a large volume. Further, each new volume would require a user to manually insert a set of markers to indicate the region of interest. We eliminated this step entirely by using image registration. In practice, we ran the variational segmentation model for one 3D volume (aneurysm, tumour or organ), obtaining an initial segmentation. For each subsequent scan we simply superimposed this segmentation to the new scan, thereby registering the saved segmentation onto the new image, rather than re-running the variational method on each new image. The overlapped images were then registered by a previously trained network 19 , which, although not entirely accurate, produced an estimate that could be fed to the CNN, to produce an accurate final result. This image with the initial segmentation was not excluded, and maintained a place in the training set. This registration step removed the need for user interaction, making the method automatic.
The final stage involved feeding each estimated segmented volume to the CNN. Unlike in commonly used methods of segmentations by CNNs 4,6 , ours received both the scan and the estimated segmented volume (the output of the variational model). This provided the CNN with supplementary information to produce an accurate final result. The CNN was trained in a standard way using backpropagation (see appendix for details).
"Unlabelled" scans (i.e., image volumes without manual segmentation) were also used to train our pipeline. The estimated volume from the variational method was used here in place of the reference standard. This allowed us to expand our training set, exposing the network to more data without needing more time-consuming manual labeling. This is commonly known as a semi-supervised approach to learning (see appendix).
Data analysis and presentation. The accuracy of an automated segmentation pipeline depends on its ability to correctly identify all voxels, hence volume, belonging to an organ/lesion in a scan, as well as those that lie outside said organ/lesion. To evaluate the accuracy of our pipeline against manual segmentation we reported true positives (TP) as the number of voxels correctly identified by a segmentation method; false positives (FP) www.nature.com/scientificreports/ and false negatives (FN), as the number of voxels incorrectly identified/excluded by a segmentation method; sensitivity as the ability of a model to correctly identify all relevant voxels or volume, and false negative rate (the inverse of sensitivity). We did not report true negatives, as these are largely dependent on the total number of voxels included in a scan (usually a whole section of the body), hence values are always very high. Continuous variables were expressed as median and range, as they were generally not normally distributed. Correlation was evaluated by graphical methods and agreement with Bland-Altman plots 20 . We also used more commonly used segmentation metrics to evaluate the pipeline, including the DICE, the JACCARD and the volumetric similarity (VS) coefficients, which are defined as: Finally, the Hausdorff Distance (HD) between a segmented volume X and a ground truth segmentation GT is defined as: where h(X, GT) is the directed Hausdorff distance given by: where |x − y| is the Euclidean distance between two points, where x is in X and y is in GT.
Both DICE and JACCARD scores range between 0 (where no overlap between output and reference occurs) and 1 (where the output is exactly the reference segmentation). VS represents how similar the volume of the segmented output is to the volume of the reference segmentation. This is not influenced by the overlap: two organs with the exact same volume but in different positions would result in a VS score of 1. Finally, HD describes the largest distance from one volume to the nearest point in the other. Unlike the other metrics described here, the ideal result of HD is a score of 0, as the voxels of an ideal segmentation would be in the same place as those of the reference standard.
In order to compare our pipeline to a more traditional method of automated segmentation, we trained a deep learning model in a typical fashion (using only the scan as input, without the input of our variational model). This method is referred as "standard" in our results.
As further experiments, we trained the hybrid method with a decreasing number of data in the training set in order to determine if the hybrid approach can perform better than the standard method when provided with less data. In addition we run the hybrid method on some simplified networks with lighter architectures (see appendix).
Ethical approval. The study was conducted in accordance with relevant institutional guidelines and regulations.

Results and discussion
Results. Aortas. Figure 1 displays a 2D example output of both approaches, showing various cross-sections of a 3D volume. On manual segmentation, the median (range) thrombus volume was 185 (129-531) ml, corresponding to 179,303 (111,336-588,375) voxels, whereas the stent and lumen volume was 59 (45-88) ml, corresponding to 55,717 (36,117-102,904) voxels. The accuracy of the "standard" and hybrid pipelines is reported in Tables 1 and 2. Notably, the hybrid method had slightly more TPs, less FPs and more FNs, resulting in higher DICE, JACCARD and VS for the whole aneurysm and the thrombus. Correlation and agreement between the measurements of the two pipelines is displayed in Fig. 2.
Abdomen/BraTS. Results of the segmentation of abdominal organs are displayed in Tables 5 and 6. For all organs, the hybrid pipeline was more accurate than the standard, producing more true positives and fewer false positives or negatives than the standard one, except for the spleen, in whose segmentation the hybrid pipeline resulted in a higher median number of false negatives. Similarly, the hybrid pipeline also outperformed the standard one in the segmentation of brain tumours (Tables 3, 4).

Discussion.
Our results suggest that the combination of a mathematical (variational) approach with a deep learning algorithm may improve the accuracy of automated segmentation of organs and tumours on CT.
For general purposes, variational and deep learning segmentation have been studied widely, though limited work has been conducted into hybrid approaches [7][8][9] . Recently, deep learning has become the preferred approach to segmentation and can produce outstanding results on a wide variety of applications 21-23 . Wang et al. 24,25 used a similar approach to ours, albeit fully based on deep learning. In their proposed method, an initial network   www.nature.com/scientificreports/ produces an initial segmentation; if this is not desirable a second network is used to refine the result. Our proposed method is similar, but, in place of a network, the first step is a variational algorithm, which we recently developed specifically to improve segmentation in the presence of low contrast 10 . Use of the variational method is innovative, and it allows us greater control over the initial segmentation when compared with the work by Wang et al. 24,25 . Further, a major weakness of deep learning methods is their heavy reliance on large, labelled datasets. As we can rely on the variational method to provide a reasonably good initial segmentation, we can also use unlabelled data in the training stage, thus expanding our training dataset. The output segmentation should therefore be more resilient to variation.    www.nature.com/scientificreports/ Aortic segmentation. Automated segmentation of the abdominal aorta has been addressed before to some success. Early approaches include variational model-based methods such as level set methods (a mathematical way of representing a shape) by Loncaric et al. 26 and Subasic et al. 27,28 , whereas Zohios et al. 29 used level sets and geometrical methods to segment the thrombus in the presence of calcifications. These models can be very timeconsuming, and susceptible to imprecision where low contrast is present. More recently Lalys et al. 30 proposed a fast 3D model based on the snakes model by Kass et al. 31 , capable of segmenting both the lumen and thrombus but requiring some user input. While quoting a mean DICE score of 0.87 on post operative CTA scans, use of a shape based deformable model, using image registration to achieve segmentation, can suffer on unusual scans. Based on a similar idea, Lareyre et al. 32 proposed a fully automated pipeline for segmentation of AAAs, obtaining both lumen and thrombus incorporating the Chan-Vese model 2 . Evaluation of thrombus segmentation was performed on 525 selected slices from 40 CT scans, giving a mean: DICE of 0.88, Jaccard of 0.80 and Sensitivity of 0.91, which performs slightly worse than our pipeline's mean results of 0.91, 0.84 and 0.92 respectively.
Deep learning methods include that by Lopez et al. 33 , who developed a fully automatic approach for segmenting the thrombus on CT scans of patients treated with Endovascular Aneurysm Repair (EVAR) using a new Table 6. Performance on segmentation metrics of the Variational Method (VM) only, and Standard and Hybrid pipeline in organ segmentation. Units given as mean ± standard deviation. www.nature.com/scientificreports/ network architecture based on the proposed work by Xie et al. 34 . In a follow up work 35 , the authors extended their work to segment 3D volumes, maintaining the fully automatic aspect. Notably, segmentation was performed both on preoperative and postoperative CT scans, with a mean DICE score of 0.89 and Jaccard of 0.81 on segmentation of the whole aneurysm on postoperative scans. Lu et al. 36 proposed a 3D pipeline using a V-Net architecture 37 combined with an ellipse fitting to estimate the maximum diameter of the aorta. Both contrast and non-contrast enhanced scans were used, quoting DICE scores of 0.89 and 0.90 on preoperative scans with and without contrast respectively. More recently the work by Caradu et al. 38 proposed a deep learning algorithm trained to segment preoperative infrarenal aortic aneurysm CT volumes effectively, with a mean DICE score of 0.95 on 100 scans. Adam et al. 39 introduced an automated method named Augmented Radiology for Vascular Aneurysm (ARVA), trained on a large dataset of 489 CT volumes (a combination of both preoperative and postoperative scans), dedicated to segmenting the entire aorta from the ascending portion to the iliac arteries, with a mean DICE score of 0.95 on preoperative scans and 0.93 on postoperative scans, thus comparable to ours, but achieved with fewer training data. This study nevertheless confirms that the use of an initial variational method can reduce the need for larger datasets.
BraTS. The BraTS dataset is a widely used imaging dataset in the literature [40][41][42] . All scans include several MR sequences, including T1, post-contrast T1-weighted, T2-weighted and T2-FLAIR. Labels are also provided for the whole tumour, tumour core, and enhancing tumour regions. In order to simplify the experiments, we only segmented the T2-FLAIR sequence, which is commonly used in brain imaging. The BraTS dataset draws a fair amount of attention: for example, Jiang et al. 40 developed a two-stage model using a U-Net 4 , utilising all sequences and using a post-processing thresholding technique. If the enhancing tumour region was less than a hand-tuned threshold, then the region was replaced with necrosis, which may cause significant improvement to the results . An average DICE of 0.888 for tumours was reported. Zhao et al. 41 made use of a pipeline involving a CNN and a number of expedients including different methods of sampling, patch-based training and teacher-student models, resulting in a reported mean DICE score of 0.883. Ali et al. 42 exploited multiple CNNs trained separately, with final predictions based on ensembling the probability maps from each CNN, with a mean DICE of 0.906 for the whole tumour. These works developed specific pipelines with a particular focus on brain tumour segmentation, making use of all available sequences. Our proposed pipeline, which was not brainspecific, did not produce quite as good results but was only based on T2 FLAIR sequences. It is possible that its accuracy would have been greater if more sequences had been used. It is also possible that its performance on MR-acquired images and/or brain images may not as good as that on CT-acquired images and/or aortic ones.
Abdominal organs. Gibson et al. 43 proposed a new network architecture for the purpose of multi-organ segmentation in abdominal CT scans. A mean DICE of 0.93, 0.95, 0.95 was reported for the left kidney, the spleen and liver respectively. Another method was developed by Cai et al. 44 who developed a novel shape learning network architecture, building an expected shape for the organ into the model. A mean DICE score of 0.96 and 0.94 was reported for the spleen and liver respectively. Weston et al. 45 implemented a deep learning method aimed at segmenting the complete abdomen and pelvis using a locally collected dataset. A variation on the 3D UNet was implemented, and a mean DICE score of 0.93, 0.93, 0.88, 0.95 was reported for the kidneys(combined), spleen, aorta and liver respectively. These methods were developed specifically for abdominal segmentation. Although our pipeline was not quite as accurate, the quantitative results demonstrate the advantage of the hybrid pipeline over a conventional approach. A summary of the discussed methods from the literature and the proposed method can be found in Table 7.
The main limitation of our study is the limited testing of the hybrid pipeline. It is possible that its accuracy may not be reproducible when applied to other organs, to scans performed with different settings (for example: inferior resolution), to patients with less well-defined boundary conditions between body structures on CT or MR, or in presence of artefacts. Notably, all the above issues would also interfere with manual segmentation, which is still considered the gold standard. More extensive training/testing of the pipeline would be necessary to clarify its generalisability.

Conclusion
Our fully automatic segmentation pipeline combining elements from both mathematical modelling and artificial intelligence has been shown to be an accurate method for segmenting aortic 3D volumes and performed well also when applied to MR images of brain tumours and abdominal organs.
We plan to test and refine the pipeline further on similar but not identical clinical tasks, such as segmentation of preoperative CT scans of the aorta. This could be achieved by using the current pipeline as a baseline and potentially modifying it to suit preoperative scans. Our segmentation pipeline provides the groundwork for developing a method for serial comparison of images, potentially reducing operator time and bias.
Ultimately, further testing on larger datasets will be necessary before attempting clinical experimentation and translation.